
mainLink='https://github.com/DACSS-CSSmeths/Spatial-Exploring/raw/refs/heads/main/'
tobLink='dataCIA/Tobacco_use.csv'
alcLink='dataCIA/Alcohol_consumptionPerCapita.csv'
obeLink='dataCIA/Obesity_adultPrevalenceRate.csv'
popLink='dataCIA/Population_total.csv'
areaLink='dataCIA/Area.csv'
import pandas as pd
tobDat=pd.read_csv(mainLink+tobLink)
alcDat=pd.read_csv(mainLink+alcLink)
obeDat=pd.read_csv(mainLink+obeLink)
popDat=pd.read_csv(mainLink+popLink)
areaDat=pd.read_csv(mainLink+areaLink)
# num of rows
tobDat.shape[0],alcDat.shape[0], obeDat.shape[0],popDat.shape[0],areaDat.shape[0]
(164, 189, 192, 237, 256)
As you see, the amount of countries (rows) in each data is different from the others.
The data merging¶
We need one data frame after merging all of them, the result has problematic column names:
allCia=obeDat.merge(tobDat.merge(alcDat,on='name'),on='name')
allCia.columns
Index(['name', 'obesityAdults_rate', 'region', 'TobaccoUse_perc', 'region_x',
'Alcohol_LitersPerCap', 'region_y'],
dtype='object')
Notice that preprocesing concepts are always needed in any project. In this case:
- name column appears once, it was the key column for the merging of the three data sets.
- region was not a key column, but it appeared in all the data frames; so, the column is repeated with a suffix at the end (x,_y). Notice that _x refers to the location of the object related to the _merge() function:left.merge(right)
We will simply subset in this case:
# subsetting
allCia=allCia.loc[:,['name','region','obesityAdults_rate','TobaccoUse_perc','Alcohol_LitersPerCap']]
# check data types
allCia.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 162 entries, 0 to 161 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 162 non-null object 1 region 162 non-null object 2 obesityAdults_rate 162 non-null float64 3 TobaccoUse_perc 162 non-null float64 4 Alcohol_LitersPerCap 162 non-null float64 dtypes: float64(3), object(2) memory usage: 6.5+ KB
Notice that the resulting dataframe got even less countries than any of the other ones after the merge. You would need to apply fuzzy merge if you want to recover some rows during the merge. The good news is that the columns are in the rigth data type. Please review my previous work if these two concepts were not familiar to you.
Let me prepare another merge:
poparea=popDat.merge(areaDat, on=['name','region'])
poparea
| name | Population_total | region | area_SqKm | |
|---|---|---|---|---|
| 0 | China | 1,416,043,270 | East and Southeast Asia | 9,596,960 |
| 1 | India | 1,409,128,296 | South Asia | 3,287,263 |
| 2 | United States | 341,963,408 | North America | 9,833,517 |
| 3 | Indonesia | 281,562,465 | East and Southeast Asia | 1,904,569 |
| 4 | Pakistan | 252,363,571 | South Asia | 796,095 |
| ... | ... | ... | ... | ... |
| 232 | Tokelau | 1,647 | Australia and Oceania | 12 |
| 233 | Paracel Islands | 1,440 | East and Southeast Asia | 8 |
| 234 | Holy See (Vatican City) | 1,000 | Europe | 0 |
| 235 | Cocos (Keeling) Islands | 593 | Australia and Oceania | 14 |
| 236 | Pitcairn Islands | 50 | Australia and Oceania | 47 |
237 rows × 4 columns
Notice that I used TWO keys this time. This avoid the previous column suffixes. Let's see the data types:
poparea.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 237 entries, 0 to 236 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 237 non-null object 1 Population_total 237 non-null object 2 region 237 non-null object 3 area_SqKm 237 non-null object dtypes: object(4) memory usage: 7.5+ KB
As you may have guessed, the numeric columns are not properly recognized as such. You may need to delete the commas and turn the columns into a numerical data type, or just reload the data with other arguments:
popDat=pd.read_csv(mainLink+popLink, thousands=',')
areaDat=pd.read_csv(mainLink+areaLink, thousands=',')
Now we have it right!
poparea=popDat.merge(areaDat, on=['name','region'])
poparea.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 237 entries, 0 to 236 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 237 non-null object 1 Population_total 237 non-null int64 2 region 237 non-null object 3 area_SqKm 237 non-null int64 dtypes: int64(2), object(2) memory usage: 7.5+ KB
Geo Merging¶
Remember we created these maps the previous week:
import geopandas as gpd
mapsLink=mainLink+'maps/worldMaps_Py.gpkg'
# just the list of maps
gpd.list_layers(mapsLink)
| name | geometry_type | |
|---|---|---|
| 0 | countries_poly | MultiPolygon |
| 1 | rivers_line | MultiLineString |
| 2 | cities_point | Point |
Let's open the countries_poly map:
countries=gpd.read_file(mapsLink, layer='countries_poly')
countries.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 252 entries, 0 to 251 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 COUNTRY 252 non-null object 1 geometry 252 non-null geometry dtypes: geometry(1), object(1) memory usage: 4.1+ KB
Let's see some contents:
countries.head()
| COUNTRY | geometry | |
|---|---|---|
| 0 | Aruba (Netherlands) | MULTIPOLYGON (((-69.88223 12.41111, -69.94695 ... |
| 1 | Antigua and Barbuda | MULTIPOLYGON (((-61.73889 17.54055, -61.75195 ... |
| 2 | Afghanistan | MULTIPOLYGON (((61.27656 35.60725, 61.29638 35... |
| 3 | Algeria | MULTIPOLYGON (((-5.15213 30.18047, -5.13917 30... |
| 4 | Azerbaijan | MULTIPOLYGON (((46.54037 38.87559, 46.49554 38... |
This map has no interesting information beyond the geometry, which, as we know, serves to plot the polygons.
import matplotlib.pyplot as plt
countries.plot()
plt.show();
Then, let's add the population data to the map.
Notice that when geoMerging, the map object has to be the LEFT object: GEO_DF.merge(DF)
mapPopu=countries.merge(poparea, left_on='COUNTRY', right_on='name')
mapPopu.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 174 entries, 0 to 173 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 COUNTRY 174 non-null object 1 geometry 174 non-null geometry 2 name 174 non-null object 3 Population_total 174 non-null int64 4 region 174 non-null object 5 area_SqKm 174 non-null int64 dtypes: geometry(1), int64(2), object(3) memory usage: 8.3+ KB
Notice that the merge has now less COUNTRIES. You should consider fuzzy merge to reduce missing values.
Let me color the polygons using the population data:
mapPopu.plot(column='Population_total',legend=True)
<Axes: >
Is it that simple? Not really. Of course is not a difficult task. After you have read the recommended reading (Herries, 2018), you should know that population is a count, an example of extensive statistics, and we should process data so that they represent intensive stats when plotted. Let's compute population density, as density is an intensive stat:
mapPopu['popDensity']=mapPopu.Population_total/mapPopu.area_SqKm
Choropleths¶
Are we ready for a choropleth? the popDensity we just created is an intensive stat. Lets' see its distribution:
import seaborn as sea
sea.kdeplot(data=mapPopu, x="popDensity")
sea.rugplot(data=mapPopu, x="popDensity",height=.1)
plt.show()
Then, you know that outliers are present, let's sort the data:
mapPopu.sort_values(by=['popDensity'],ascending=False)
| COUNTRY | geometry | name | Population_total | region | area_SqKm | popDensity | |
|---|---|---|---|---|---|---|---|
| 97 | Monaco | MULTIPOLYGON (((7.39161 43.72755, 7.3909 43.74... | Monaco | 31813 | Europe | 2 | 15906.500000 |
| 139 | Singapore | MULTIPOLYGON (((103.99054 1.38329, 103.99795 1... | Singapore | 6028459 | East and Southeast Asia | 719 | 8384.504868 |
| 11 | Bahrain | MULTIPOLYGON (((50.65055 26.24472, 50.61305 26... | Bahrain | 1566888 | Middle East | 760 | 2061.694737 |
| 101 | Malta | MULTIPOLYGON (((14.51972 35.8, 14.42389 35.818... | Malta | 469730 | Europe | 316 | 1486.487342 |
| 103 | Maldives | MULTIPOLYGON (((72.97998 7.02778, 72.98272 7.0... | Maldives | 388858 | South Asia | 298 | 1304.892617 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 63 | Guyana | MULTIPOLYGON (((-60.08087 5.16151, -60.08195 5... | Guyana | 794099 | South America | 214969 | 3.694016 |
| 68 | Iceland | MULTIPOLYGON (((-22.02472 64.41888, -22.02444 ... | Iceland | 364036 | Europe | 103000 | 3.534330 |
| 9 | Australia | MULTIPOLYGON (((142.27997 -10.26556, 142.21053... | Australia | 26768598 | Australia and Oceania | 7741220 | 3.457930 |
| 168 | Namibia | MULTIPOLYGON (((14.52485 -22.69207, 14.5274 -2... | Namibia | 2803660 | Africa | 824292 | 3.401295 |
| 94 | Mongolia | MULTIPOLYGON (((116.6718 46.32724, 116.58554 4... | Mongolia | 3281676 | East and Southeast Asia | 1564116 | 2.098103 |
174 rows × 7 columns
Outliers will make the coloring challenging:
# find Monaco and Singapur!
mapPopu.plot(column='popDensity',figsize=(15,25))
<Axes: >
Given this behavior, we could experiment with some transformation:
import numpy as np
#log of var
mapPopu['popDensity_log10']=np.log10(mapPopu.popDensity)
mapPopu.plot(column='popDensity_log10',figsize=(15,25))
<Axes: >
As I shared at the begining, we need some thinking before coloring the polygons so that it can be an interesteting and useful choropleth.
A Choropleth for CONTINUOUS values¶
Let me use the other variables we have to create a new map:
countriesCIA=countries.merge(allCia, left_on='COUNTRY', right_on='name')
countriesCIA.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 147 entries, 0 to 146 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 COUNTRY 147 non-null object 1 geometry 147 non-null geometry 2 name 147 non-null object 3 region 147 non-null object 4 obesityAdults_rate 147 non-null float64 5 TobaccoUse_perc 147 non-null float64 6 Alcohol_LitersPerCap 147 non-null float64 dtypes: float64(3), geometry(1), object(3) memory usage: 8.2+ KB
The merge gave us again a GeoDF, let's see the distribution of Tobacco use:
sea.kdeplot(data=countriesCIA, x="TobaccoUse_perc")
sea.rugplot(data=countriesCIA, x="TobaccoUse_perc",height=.1)
plt.show()
Now the map:
countriesCIA.plot(column='TobaccoUse_perc',legend=True)
plt.show();
The variable has already been transformed, and represent an intensive stat. But this is not the best we can do (review the recommended reading)::
- We are allowing the program to choose the color palette.
- We are not clarifying that we have missing values.
- The axes coordinates may not be needed.
- We have too many shades of colors. Discretization may be needed.
About color palettes¶
The default is viridis:
countriesCIA.plot(column='TobaccoUse_perc',legend=True,cmap='viridis')
plt.show();
Of course there are other palettes.
countriesCIA.plot(column='TobaccoUse_perc',legend=True,cmap='magma')
plt.show();
Remember to choose a palette that DOES NOT have red and green simultaneously, colorblinded people have trouble when both are present.
About missing values¶
You should inform what is missing. A good strategy is to create another map of only the borders:
countries.dissolve()
| geometry | COUNTRY | |
|---|---|---|
| 0 | MULTIPOLYGON (((-159.77335 -79.86084, -159.924... | Aruba (Netherlands) |
The cell 'Aruba' can be changed to 'world', but it is not really needed. Let me go ahead and create the new map, and use it a as base layer.
worldBorders=countries.dissolve() # saving the dissolved map
# plotting two layers:
base=worldBorders.plot(color='lightgrey') # layer 1
countriesCIA.plot(column='TobaccoUse_perc',legend=True,cmap='magma',
ax=base) # layer 2, on top on layer 1
base.set_title("Tobacco Use")
plt.show();
About axes coordinates¶
Coordinates are important, but in some cases they can be omitted. You need to evaluate that on a case by case basis.
base=worldBorders.plot(color='lightgrey')
countriesCIA.plot(column='TobaccoUse_perc',
legend=True,
cmap='magma',
ax=base)
# here!
base.set_title("Tobacco Use - (non discretized)")
base.set_axis_off()
plt.show();
I will share the code to create this last map in R too.
A Choropleth for DISCRETIZED values¶
Discretizing is about cutting the variable in intervals. This is a usual operation, which is used by histogramas to compute the width of each bin:
countriesCIA.loc[:,'TobaccoUse_perc'].hist()
plt.show();
For histograms, 10 bins is the default, and generally that is a good idea; but in maps we should try 3 or 5.
The bining schemes¶
Let me compute 5 bins using different schemes:
## make sure you have
# !pip show pysal mapclassify numpy
import mapclassify
import numpy as np
np.random.seed(12345) # so we all get the same results!
# let's try 5 bins
K=5
theVar=countriesCIA.TobaccoUse_perc
# same interval width, easy interpretation
ei5 = mapclassify.EqualInterval(theVar, k=K)
# same interval width based on standard deviation, easy - but not as the previous one, poor when high skewness
msd = mapclassify.StdMean(theVar)
# interval width varies, counts per interval are close, not easy to grasp, repeated values complicate cuts
q5=mapclassify.Quantiles(theVar,k=K)
# based on similarity, good for multimodal data
mb5 = mapclassify.MaximumBreaks(theVar, k=K)
# based on similarity, optimizer
fj5 = mapclassify.FisherJenks(theVar, k=K)
# based on similarity, optimizer
jc5 = mapclassify.JenksCaspall(theVar, k=K)
# based on similarity, optimizer
mp5 = mapclassify.MaxP(theVar, k=K)
###### based on similarity, good for skewed data
ht = mapclassify.HeadTailBreaks(theVar) # no K needed
/Users/JoseManuel/opt/anaconda3/envs/CSS_meth_3_13/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3577: UserWarning: Numba not installed. Using slow pure python version. exec(code_obj, self.user_global_ns, self.user_ns)
How can we select the right classification? Let me use the the Absolute deviation around class median (ADCM) to make the comparisson:
class5 = ei5,msd, q5,mb5, ht, fj5, jc5, mp5
# Collect ADCM for each classifier
fits = np.array([ c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms['classifier'] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ['ADCM', 'Classifier']
Now, plot the adcms:
adcms.sort_values('ADCM').plot.barh(x='Classifier')
plt.show();
As you see, the best scheme is 'Fisher Jenks'. Then, we can request that in the plot:
base=worldBorders.plot(color='lightgrey',
figsize=(15, 10))
countriesCIA.plot(column='TobaccoUse_perc', cmap='viridis', ax=base,
scheme="fisher_jenks",
linewidth=0.,
legend=True,
legend_kwds={"title": "TOBACCO USE (%)\n(missing data in grey)"}
)
leg = base.get_legend()
leg.set_bbox_to_anchor((0.2, 0.6)) #(left, bottom)
base.set_axis_off()
plt.show();
/Users/JoseManuel/opt/anaconda3/envs/CSS_meth_3_13/lib/python3.13/site-packages/geopandas/plotting.py:752: UserWarning: Numba not installed. Using slow pure python version. binning = mapclassify.classify(
Above you see the bins limits that Fisher Jenks gave you.
# the code assigned to each country
fj5.yb
array([2, 2, 2, 2, 2, 3, 2, 1, 2, 1, 0, 2, 2, 4, 0, 4, 1, 0, 3, 4, 1, 4,
1, 1, 1, 2, 0, 2, 2, 3, 0, 2, 0, 0, 1, 4, 1, 0, 1, 2, 2, 3, 0, 0,
0, 2, 2, 3, 3, 0, 2, 3, 0, 1, 0, 4, 3, 1, 3, 1, 2, 2, 1, 2, 0, 4,
1, 2, 4, 1, 2, 3, 4, 4, 3, 0, 3, 2, 2, 3, 3, 3, 0, 0, 1, 2, 0, 2,
0, 2, 3, 1, 2, 1, 0, 1, 0, 2, 1, 3, 4, 1, 1, 0, 2, 2, 0, 2, 4, 0,
1, 3, 2, 3, 1, 1, 2, 2, 0, 2, 1, 1, 3, 4, 2, 2, 2, 3, 0, 0, 2, 4,
0, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 4])
# the frequency table with the breaks
fj5.table
<bound method MapClassifier.table of FisherJenks Interval Count ---------------------- [ 3.50, 10.90] | 31 (10.90, 18.50] | 34 (18.50, 26.40] | 46 (26.40, 33.50] | 21 (33.50, 48.50] | 15>
Recoding and labelling¶
We can use that information to create an ordinal variable:
# new variable
countriesCIA['tobacco_code']=fj5.yb
# create 'tobacco_levels' as a copy of 'tobacco_code'
countriesCIA=countriesCIA.assign(tobacco_levels=countriesCIA.tobacco_code)
Notice that I have two new columns with the same information.
Now, let's give labels to one of them:
# map of labels for the levels
newLevels={0:'1.very low',1:'2.low', 2:'3.average',3:'4.high', 4:'5.very high'}
# recoding
countriesCIA.replace({'tobacco_levels':newLevels}, inplace=True)
This is our new plot using the diverging palette PiYG (I added _r to reverse the color order):
base=worldBorders.plot(color='lightgrey',
figsize=(15, 10))
countriesCIA.plot(column='tobacco_levels',ax=base,
cmap='PiYG_r', # reversed
categorical=True,
linewidth=0.,
legend=True,
legend_kwds={"title": "TOBACCO USE (%)\n(missing data in grey)"}
)
base.set_axis_off()
# locating the legend
leg = base.get_legend()
leg.set_bbox_to_anchor((0.2, 0.6)) #(left, bottom), 1 is max value (right/top)
plt.show();
I will share the R version for this map.
Custom bins¶
Remember that we should always consider custom made bins, instead of "automatically" detected ones. This is important when some thresholds are institutionalised. The process is simple:
- Check values:
countriesCIA.TobaccoUse_perc.describe()
count 147.000000 mean 20.278231 std 9.723848 min 3.500000 25% 11.900000 50% 20.800000 75% 26.100000 max 48.500000 Name: TobaccoUse_perc, dtype: float64
- Decide breaks:
customBreaks=[0,5,15,30,40,100]
pd.cut(countriesCIA.TobaccoUse_perc, customBreaks,ordered=True,include_lowest=True).value_counts(sort=False)
TobaccoUse_perc (-0.001, 5.0] 3 (5.0, 15.0] 51 (15.0, 30.0] 66 (30.0, 40.0] 25 (40.0, 100.0] 2 Name: count, dtype: int64
- Save the new column with labels:
theCustomLabels=['1. below5','2.(5-15]','3.(15-30]','4. (30-40]' , '5. above40']
countriesCIA['tobacco_custom']=pd.cut(countriesCIA.TobaccoUse_perc,
customBreaks,
ordered=True,
include_lowest=True,
labels=theCustomLabels)
countriesCIA.info()
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 147 entries, 0 to 146 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 COUNTRY 147 non-null object 1 geometry 147 non-null geometry 2 name 147 non-null object 3 region 147 non-null object 4 obesityAdults_rate 147 non-null float64 5 TobaccoUse_perc 147 non-null float64 6 Alcohol_LitersPerCap 147 non-null float64 7 tobacco_code 147 non-null int64 8 tobacco_levels 147 non-null object 9 tobacco_custom 147 non-null category dtypes: category(1), float64(3), geometry(1), int64(1), object(4) memory usage: 10.8+ KB
Interactive version¶
Interactive choropleths are important not because they may seem impressive, but because you need them to help you inform values as you visit a polygon (tooltips), and if you need to zoom in and out constantly.
An interactive version follows:
countriesCIA.explore(
column="tobacco_levels", # make choropleth based on "fisherJenks_levels" column
tooltip=["name","TobaccoUse_perc"], # what to show on hover
tiles="CartoDB positron", # use "CartoDB positron" tiles
cmap="PiYG_r", # colormap
style_kwds=dict(color="black"), # use black outline
legend_kwds={'caption':'Tobacco Use<br>(missing has no tooltip)'}
)
Behind the scenes, geopandas is using folium. The code I will share in R will use leaflet for interactive features.
Creating layers from the discrete values¶
Let me subset the map with CIA data:
countriesCIA_good=countriesCIA[countriesCIA.tobacco_code==0]
countriesCIA_mean=countriesCIA[countriesCIA.tobacco_code==2]
Now, plot them in one map:
fig, axes = plt.subplots(nrows=1,ncols=2,figsize=(15, 10))
# main title
fig.suptitle('Tobacco Use', size=16,y=0.7)
# map1
base1=worldBorders.plot(color='lightgrey',ax=axes[0])
countriesCIA_good.plot(color='green',ax=base1)
base1.set_title('The countries with lowest use')
base1.set_axis_off()
#map2
base2=worldBorders.plot(color='lightgrey',ax=axes[1])
countriesCIA_mean.plot(color='orange',ax=base2)
base2.set_title('The countries with average use')
base2.set_axis_off()
#display
plt.show();
I will share the R code for this last map in R, using facets.
Let me use ONE map to represent the previous one (with two sub maps)in an interactive way:
import folium # I need folium
base = countriesCIA_good.explore(
color='green',
tooltip=["name","TobaccoUse_perc"], # what to show on hover
name="good", # name of the layer in the map
)
countriesCIA_mean.explore(
m=base, # the 'ax'
tooltip=["name","TobaccoUse_perc"], # what to show on hover
color="orange",
name="average",
)
folium.LayerControl().add_to(base) # use folium to add layer control
base # show map
Currently we have:
countriesCIA.head()
| COUNTRY | geometry | name | region | obesityAdults_rate | TobaccoUse_perc | Alcohol_LitersPerCap | tobacco_code | tobacco_levels | tobacco_custom | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | MULTIPOLYGON (((61.27656 35.60725, 61.29638 35... | Afghanistan | South Asia | 5.5 | 23.3 | 0.01 | 2 | 3.average | 3.(15-30] |
| 1 | Algeria | MULTIPOLYGON (((-5.15213 30.18047, -5.13917 30... | Algeria | Africa | 27.4 | 21.0 | 0.59 | 2 | 3.average | 3.(15-30] |
| 2 | Azerbaijan | MULTIPOLYGON (((46.54037 38.87559, 46.49554 38... | Azerbaijan | Middle East | 19.9 | 24.0 | 1.38 | 2 | 3.average | 3.(15-30] |
| 3 | Albania | MULTIPOLYGON (((20.79192 40.43154, 20.78722 40... | Albania | Europe | 21.7 | 22.4 | 4.40 | 2 | 3.average | 3.(15-30] |
| 4 | Armenia | MULTIPOLYGON (((46.54037 38.87559, 46.51639 38... | Armenia | Middle East | 20.2 | 25.5 | 3.77 | 2 | 3.average | 3.(15-30] |
Beyond choroplething¶
At every one of the previous stages you may need to make several assumptions and use a particular technique. At the end, you do all this work not to color a map in a nice way, but to give some interesting information that may start a decision making process. Most of the time, these requires combining several variables to say something relevant.
Let's prepare the bins for the Alcohol data, and then propose something interesting.
# based on similarity, optimizer
fj5_alc = mapclassify.FisherJenks(countriesCIA.Alcohol_LitersPerCap, k=K)
# get the values
countriesCIA['alcohol_code']=fj5_alc.yb
# a copy in another column
countriesCIA=countriesCIA.assign(alcohol_levels=countriesCIA.alcohol_code)
# recoding
countriesCIA.replace({'alcohol_levels':newLevels}, inplace=True)
countriesCIA.head()
/Users/JoseManuel/opt/anaconda3/envs/CSS_meth_3_13/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3577: UserWarning: Numba not installed. Using slow pure python version. exec(code_obj, self.user_global_ns, self.user_ns)
| COUNTRY | geometry | name | region | obesityAdults_rate | TobaccoUse_perc | Alcohol_LitersPerCap | tobacco_code | tobacco_levels | tobacco_custom | alcohol_code | alcohol_levels | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | MULTIPOLYGON (((61.27656 35.60725, 61.29638 35... | Afghanistan | South Asia | 5.5 | 23.3 | 0.01 | 2 | 3.average | 3.(15-30] | 0 | 1.very low |
| 1 | Algeria | MULTIPOLYGON (((-5.15213 30.18047, -5.13917 30... | Algeria | Africa | 27.4 | 21.0 | 0.59 | 2 | 3.average | 3.(15-30] | 0 | 1.very low |
| 2 | Azerbaijan | MULTIPOLYGON (((46.54037 38.87559, 46.49554 38... | Azerbaijan | Middle East | 19.9 | 24.0 | 1.38 | 2 | 3.average | 3.(15-30] | 0 | 1.very low |
| 3 | Albania | MULTIPOLYGON (((20.79192 40.43154, 20.78722 40... | Albania | Europe | 21.7 | 22.4 | 4.40 | 2 | 3.average | 3.(15-30] | 2 | 3.average |
| 4 | Armenia | MULTIPOLYGON (((46.54037 38.87559, 46.51639 38... | Armenia | Middle East | 20.2 | 25.5 | 3.77 | 2 | 3.average | 3.(15-30] | 1 | 2.low |
This is the distribution:
countriesCIA.alcohol_levels.value_counts()
alcohol_levels 1.very low 45 2.low 31 5.very high 26 4.high 24 3.average 21 Name: count, dtype: int64
Now that we have two variables, we could ask ourselves what countries are high (level 4) in both tobacco and alcohol consumption?
interestingCountries=countriesCIA[countriesCIA.tobacco_code + countriesCIA.alcohol_code==8]
## or
# countriesCIA[(countriesCIA.tobacco_code==4) & (countriesCIA.alcohol_code==4)]
interestingCountries
| COUNTRY | geometry | name | region | obesityAdults_rate | TobaccoUse_perc | Alcohol_LitersPerCap | tobacco_code | tobacco_levels | tobacco_custom | alcohol_code | alcohol_levels | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | Bulgaria | MULTIPOLYGON (((28.01305 41.98222, 27.9711 41.... | Bulgaria | Europe | 25.0 | 39.0 | 11.18 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high |
| 35 | Cyprus | MULTIPOLYGON (((33.27229 34.70955, 33.21722 34... | Cyprus | Europe | 21.8 | 35.1 | 9.59 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high |
| 55 | Croatia | MULTIPOLYGON (((19.03972 44.86138, 19.02972 44... | Croatia | Europe | 24.4 | 36.9 | 9.64 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high |
| 73 | Latvia | MULTIPOLYGON (((21.06861 56.43555, 21.05777 56... | Latvia | Europe | 23.6 | 37.0 | 12.90 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high |
These are the countries:
base=worldBorders.plot(color='lightgrey')
interestingCountries.plot(ax=base)
<Axes: >
Here we need to alter the basemap for an area like this:
countriesCIA[countriesCIA.tobacco_code + countriesCIA.alcohol_code==8].plot()
<Axes: >
All those polygons are within a rectangle, let me find it:
maskToClip=interestingCountries.dissolve().envelope
maskToClip.plot(color='white',edgecolor='red')
<Axes: >
I may use this rectangle to clip the world!
# new map
miniWorld=worldBorders.clip(maskToClip)
# then
base=miniWorld.plot(color='lightgrey')
interestingCountries.plot(ax=base)
plt.show()
It would be better if we can see the names. Here comes some labor:
- Get the coordinates for the country names
allCoords=[x.coords[:][0] for x in interestingCountries.representative_point()]
allCoords
[(25.170339327283575, 42.74881362915039), (33.39597016804631, 35.166385650634766), (15.35048927473903, 44.74440383911133), (24.453569548787208, 56.89241981506348)]
- Use those coordinates to create a new column:
interestingCountries=interestingCountries.assign(coordinates=allCoords)
interestingCountries
| COUNTRY | geometry | name | region | obesityAdults_rate | TobaccoUse_perc | Alcohol_LitersPerCap | tobacco_code | tobacco_levels | tobacco_custom | alcohol_code | alcohol_levels | coordinates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | Bulgaria | MULTIPOLYGON (((28.01305 41.98222, 27.9711 41.... | Bulgaria | Europe | 25.0 | 39.0 | 11.18 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high | (25.170339327283575, 42.74881362915039) |
| 35 | Cyprus | MULTIPOLYGON (((33.27229 34.70955, 33.21722 34... | Cyprus | Europe | 21.8 | 35.1 | 9.59 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high | (33.39597016804631, 35.166385650634766) |
| 55 | Croatia | MULTIPOLYGON (((19.03972 44.86138, 19.02972 44... | Croatia | Europe | 24.4 | 36.9 | 9.64 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high | (15.35048927473903, 44.74440383911133) |
| 73 | Latvia | MULTIPOLYGON (((21.06861 56.43555, 21.05777 56... | Latvia | Europe | 23.6 | 37.0 | 12.90 | 4 | 5.very high | 4. (30-40] | 4 | 5.very high | (24.453569548787208, 56.89241981506348) |
Let's use that info:
base=miniWorld.plot(color='lightgrey')
interestingCountries.plot(color='yellow', edgecolor='black',ax=base)
for idx, row in interestingCountries.iterrows():
plt.annotate(text=row['name'], xy=row['coordinates'], horizontalalignment='center', color='blue')
You will find the code in R for this one too.
The interactive version does not need the clipped map:
interestingCountries.explore()
From here, you can start hypothesing some ideas where location matters (why are you usig a map otherwise!)
Let me save this information in a geojson file:
countriesCIA.to_file('maps/countriesCIA.gpkg', driver='GPKG', layer='cia')
worldBorders.to_file('maps/countriesCIA.gpkg', driver='GPKG', layer='border')
HOMEWORK #1¶
This homework requires a repo on GitHub, and you use Python and R.
A. In Python:
1. Work with the column on obesity, and find the best scheme for 5 bins.
2. Use that same scheme to bin the columns on _alcohol_ and _tobacco_.
3. Find the countries that are doing the best in all three variables.
B. In R or Python:
1. Plot the countries you found in the last step. YOU ONLY PLOT the NON-interactive version.
The code for the first three steps has to be in one file named code_hw1. The code to plot the countries in part B has to be in another file. All the data has to be read from the PATH from GitHub. For the section B, you create a file named index, and you make sure that you create an html version with the same name. The result of section B will be published as a web page from GitHub.